Exact Multiple String Matching Problem for DNA Alphabet

نویسندگان

  • Yi-Kung Shieh
  • Shyong Jian Shyu
  • Richard Chia-Tung Lee
چکیده

Given a text T = t1t2 ... tn and a set of patterns P = {P1, P2, ..., Pr}, the exact multiple string matching problem (EMSMP) finds the ending positions of all sub-strings in T which is equal to Pi for 1  i  r. We regard all substrings in T and patterns in P as data points in an edit distance-based metric space. The data points in T are constructed into a vantage point tree (vp-tree) T. Then, EMSMP can be resolved by searching all points of P in T. We further enhance T into vpac-tree C (vp-tree with alliance cut capability), based on which more unnecessary branches might be cut off so that the searching efficiency could be improved. Experiments consisting of long texts and short patterns of DNA alphabet are conducted using the two proposed schemes and m-BNDM (the multiple pattern version of the well known BNDM approach). The computational results demonstrate the effectiveness and efficiency of our schemes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

انتخاب کوچکترین ابر رشته در DNA با استفاده از الگوریتم ازدحام ذرّات

A DNA string can be supposed a very long string on alphabet with 4 letters. Numerous scientists attempt in decoding of this string. since this string is very long , a shorter section of it that have overlapping on each other will be decoded .There is no information for the right position of these sections on main DNA string. It seems that the shortest string (substring of the main DNA string) i...

متن کامل

Multiple Pattern Matching Revisited

We consider the classical exact multiple string matching problem. Our solution is based on q-grams combined with pattern superimposition, bit-parallelism and alphabet size reduction. We discuss the pros and cons of the various alternatives of how to achieve best combination. Our method is closely related to previous work by (Salmela et al., 2006). The experimental results show that our method p...

متن کامل

Evaluation and Improvement of Fast Algorithms for Exact Matching on Genome Sequences

With the availability of large amounts of dna data, exact matching of nucleotide sequences has become an important application in modern computational biology and in meta-genomics. In the last decade several efficient solutions for the exact string matching problem have been developed and most of them are very fast in practical cases. However when the length of the pattern is short or the alpha...

متن کامل

A Fast Heuristic for Exact String Matching

Given a pattern string P of length n consisting of δ distinct characters and a query string T of length m, where the characters of P and T are drawn from an alphabet Σ of size ∆, the exact string matching problem consists of finding all occurrences of P in T . For this problem, we present a randomized heuristic that in O(nδ) time preprocesses P to identify sparse(P ), a rarely occurring substri...

متن کامل

Project 2: Pattern Matching in Compressed DNA Sequence

Space efficient storage of large genome sequences requires good compression techniques. However, if these sequences need to be decompressed, before any processing can be done over them, the advantage of compression is lost. New techniques are required to extend the traditional pattern matching algorithms to work directly on the compressed sequence. This saves space in memory, requires less disk...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016